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Abstract 

In  this  work ,  we  propose  a  novel  video  representation 
for  activity  recognition  that  models  video  dynamics  with 
attributes  of  activities.  A  video  sequence  is  decomposed 
into  short-term  segments,  which  are  characterized  by  the 
dynamics  of  their  attributes.  These  segments  are  modeled 
by  a  dictionary  of  attribute  dynamics  templates,  which  are 
implemented  by  a  recently  introduced  generative  model, 
the  binary  dynamic  system  (BDS).  We  propose  methods  for 
learning  a  dictionary  of  BDS s  from  a  training  corpus,  and 
for  quantizing  attribute  sequences  extracted  from  videos 
into  these  BDS  codewords.  This  procedure  produces  a  rep¬ 
resentation  of  the  video  as  a  histogram  of  BDS  codewords, 
which  is  denoted  the  bag-of-words  for  attribute  dynam¬ 
ics  (BoWAD).  An  extensive  experimental  evaluation  reveals 
that  this  representation  outperforms  other  state-of-the-art 
approaches  in  temporal  structure  modeling  for  complex  ac¬ 
tivity  recognition. 

1.  Introduction 

The  recognition  of  human  activities  and  events  is  an 
important  problem  for  computer  vision.  Two  lines  of  re¬ 
search  have  received  substantial  attention  in  this  area.  The 
first,  motivated  by  the  fact  that  an  activity  is  naturally  de¬ 
fined  by  an  ordered  set  of  short-term  behaviors,  aims  to 
model  the  temporal  composition  of  activities.  This  is  usu¬ 
ally  done  with  low-level  video  representations.  In  fact, 
many  methods  have  been  proposed  to  model  the  temporal 
structure  of  low-level  features  extracted  from  video,  e.g., 
histograms  of  spatiotemporal  filter  responses.  This  includes 
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Figure  1:  Challenges  in  modeling  the  dynamics  of  attributes  of 
complex  activities.  (Top)  YouTube  video  sequence  annotated  with 
“tennis- serve”  activity.  (Bottom)  associated  trajectory  on  a  3D  at¬ 
tribute  space  (red  for  “arm-motion”,  green  for  “foot  motion”  and 
blue  for  “ball  motion”).  Note  the  complexity  of  the  trajectory  and 
the  fact  that  only  a  short  segment  (red-shaded)  is  a  staple  of  the 
action  of  interest. 

both  discriminative  [11,  16,  7,  25]  and  generative  mod¬ 
els  [12,  9,  4].  The  second,  inspired  by  recent  advances  in 
image  analysis,  is  to  represent  activities  as  collections  of 
semantic  attributes  [15,  23,  22,  6].  This  entails  an  interme¬ 
diate  level  of  representation,  where  features  are  no  longer 
visual,  but  identifiers  of  the  occurrence  of  semantic  con¬ 
cepts  of  interest,  such  as  scene  types,  actions,  objects,  etc. 
This  higher  level  of  abstraction  enables  better  generaliza¬ 
tion,  facilitates  semantic  and  contextual  reasoning,  and  en¬ 
ables  knowledge  transfer  from  well-understood  examples  to 
unseen  instances. 

Advances  along  these  two  directions  are  complementary. 
While  a  detailed  characterization  of  the  temporal  structure 
on  top  of  low-level  features  is,  in  general,  insufficient  to 
characterize  complex  activities,  the  representation  of  video 
as  an  orderless  set  of  attributes  is  incapable  of  fine-grained 
activity  discrimination  (i.e.,  distinguishing  between  activi¬ 
ties  which  express  the  same  attributes  in  different  orders). 
Recently,  [14]  has  proposed  to  unify  the  two  research  di¬ 
rections,  by  modeling  the  temporal  structure  of  the  video 
projection  in  an  attribute  space.  This  was  implemented  by 
introducing  a  dynamic  model,  denoted  binary  dynamic  sys¬ 
tem  (BDS),  which  extends  classical  linear  dynamic  systems 
to  binary  observation  spaces.  While  this  model  has  been 
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shown  to  achieve  state-of-the-art  performance  in  standard 
benchmarks,  it  does  not  address  two  of  the  most  significant 
challenges  in  the  recognition  of  complex  activities.  The  first 
is  that  such  video  rarely  contains  only  the  event  of  interest. 
In  general,  video  sequences  are  only  annotated  with  respect 
to  a  dominant  event,  or  high-level  subject,  and  not  with  re¬ 
spect  to  the  footage  that  either  precedes  or  trails  it.  The  sec¬ 
ond  is  that  a  single  model,  such  as  the  BDS,  is  unlikely  to 
provide  a  good  fit  to  the  complex  attribute  space  trajectories 
produced  by  the  video.  This  is  illustrated  in  Figure  1,  which 
presents  the  trajectory  of  a  video  of  the  “tennis  serve”  activ¬ 
ity  in  a  space  spanned  by  three  closely-related  attributes. 

In  this  work,  we  propose  to  address  these  limitations 
with  a  new  video  representation,  which  is  denoted  the  bag- 
of -words  for  attribute  dynamics  (BoWAD).  This  is  an  ex¬ 
tension  of  the  bag-of-visual  words  (BoVW),  which  has 
achieved  great  popularity  for  image  classification  [2<  ].  Like 
the  BoVW,  the  BoWAD  is  an  histogram  with  respect  to 
a  dictionary  of  templates.  However,  rather  than  templates 
of  visual  appearance,  it  relies  on  templates  of  attribute  dy¬ 
namics.  These  templates  are  in  fact  generative  models  and, 
more  precisely,  temporally  localized  BDSs.  In  this  way, 
an  activity  is  represented  as  a  collection  of  characteristic 
short-term  behaviors,  and  no  single  BDS  needs  to  model  un¬ 
duly  complex  attribute  trajectories.  We  propose  a  procedure 
for  learning  a  dictionary  of  BDSs,  and  for  quantizing  video 
with  respect  to  this  dictionary,  and  show  that  the  representa¬ 
tion  achieves  performance  superior  to  that  of  state-of-the-art 
approaches  of  temporal  structure  modeling  in  challenging 
datasets. 

2.  Related  Work 

Over  the  last  decade,  the  bag-of -features  (BoF)  has  be¬ 
come  a  popular  video  representation  for  action  recogni¬ 
tion  [21].  This  consists  of  representing  video  as  a  collection 
of  feature  vectors.  Several  models  exploiting  the  temporal 
structure  of  activities  are  based  on  this  representation.  For 
example,  Laptev  et  al.  [11]  used  a  spatio-temporal  binning 
pyramid  to  match  vector-quantized  histograms  from  differ¬ 
ent  video  regions.  Niebles  et  al.  [16]  and  Gaidon  et  al.  [7] 
represented  an  activity  with  a  small  number  of  decompos¬ 
able  parts  or  atomic  actions.  Alternatives  based  on  gener¬ 
ative  models  have  also  been  proposed.  Laxton  et  al.  [12] 
integrated  confidences  about  objects  and  sub-actions  over 
time,  with  dynamic  Bayesian  networks.  Finally,  dynamic 
systems  have  been  used  to  represent  the  evolution  of  human 
activity,  using  different  features  (local  binary  patterns  [9], 
tracked  parts  [13],  or  frame- wise  motion  histograms  [4]). 

Recently,  image  analysis  research  has  shown  that  se¬ 
mantics  or  attribute-based  representations  can  have  sub¬ 
stantial  benefits  over  BoF,  including  better  generalization 
and  support  for  contextual  reasoning  [19,  10,  18,  20].  This 
has  motivated  the  application  of  these  representations  to 


action  recognition.  For  example,  Liu  et  al.  [1  ]  pro¬ 
posed  the  use  of  attributes  as  latent  variables  for  support 
vector  machines  (SVMs)  to  recognize  actions.  Sadanand 
and  Corso  [23]  have  shown  substantial  improvements  over 
standard  benchmarks  by  using  a  bank  of  action  detectors 
sampled  broadly  across  semantic  and  viewpoint  spaces. 
Rohrbach  et  al.  [22]  augmented  video  with  text- script  data 
and  modeled  activities  as  common  sets  of  attributes,  defined 
in  terms  of  basic  actions  and  objects.  Finally,  Li  and  Vas- 
concelos  [14]  introduced  a  model  (BDS)  of  the  temporal 
structure  of  attributes.  This  work  suggests  that  the  model¬ 
ing  of  video  trajectories  in  attribute  space  is  crucial  for  the 
fine-grained  understanding  of  human  behavior  . 

In  this  work,  we  expand  on  the  idea  of  [1  ],  by  learn¬ 
ing  dictionaries  of  models  for  attribute  dynamics.  This  is 
related  to  the  bag-of-systems  framework  of  [21,  1],  where  a 
set  of  dynamic  textures  (DTs)  [5]  were  used  to  characterize 
dynamic  scenes.  The  main  challenge  of  this  dictionary  lean¬ 
ing  problem  is  the  difficulty  of  identifying  the  “centroid”  of 
a  collection  of  dynamic  textures ,  due  to  the  non-Euclidean 
nature  of  the  space  of  linear  dynamic  systems.  [21]  by¬ 
passes  this  problem  with  resort  to  a  somewhat  heuristic 
combination  of  multi-dimensional  scaling  and  k- means  (de¬ 
noted  MDS-kMf  while  [1]  presents  a  procedure  to  directly 
average  LDSs  in  the  parameter  space,  the  approach  only 
works  for  LDS.  We  propose  an  alternative  principled  solu¬ 
tion,  which  is  specifically  designed  for  clustering  attribute 
sequences ,  and  has  a  number  of  advantages  over  MDS-kM. 
These  are  shown  to  result  in  superior  recognition  accuracy. 

3.  The  Bag  of  Words  for  Attribute  Dynamics 

In  this  section,  we  introduce  a  new  representation  for 
activity  recognition,  denoted  the  bag- of -words  for  attribute 
dynamics  (BoWADs). 

3.1.  Words  and  Attributes 

A  popular  representation  for  image  classification  is  the 
bag  of  visual  words  (BoVW)  [28],  which  has  recently  also 
become  popular  for  action  recognition  [2  ] .  This  consists  of 
representing  an  image  as  a  BoF,  learning  a  dictionary  of  rep¬ 
resentative  feature  vectors,  which  are  denoted  visual  words , 
and  using  this  dictionary  to  quantize  the  features  extracted 
from  an  image  to  classify.  The  BoVW  is  the  resulting  his¬ 
togram  of  visual  word  counts.  This  is  frequently  used  as  a 
feature  vector  for  image  or  video  classification.  Despite  the 
popularity  of  the  BoVW,  several  works  have  demonstrated 
the  benefits  of  alternative  feature  spaces,  which  encode 
higher-level  semantics  by  representing  images  or  video  as 
collections  of  binary  attributes  [19,  10,  18,  20,  15,  14]. 

Under  this  representation,  activities  are  defined  with  re¬ 
spect  to  a  set  of  K  attributes  C  =  {ci}f=1,  inferred  from 
video  frames  by  a  bank  of  attribute  classifiers  {71^}  fL1.  Pos¬ 
sible  attributes  include  scene  classes,  objects,  atomic  ac- 
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Figure  2:  Learning  a  BDS.  Video  sequences  (left)  are  encoded  as  trajectories  in  attribute  space  S  (center).  Sequences  of  similar  semantics 
span  similar  trajectories.  The  BDS  ft  embeds  a  video  trajectory  into  a  low-dimensional  space  (shown  in  green),  by  binary  PCA,  and  learns 
a  Gauss-Markov  process  that  describes  the  corresponding  trajectory  in  the  latent  state  space  (right). 


tions,  human-object  interactions,  etc.  A  video  v  G  X  is 
mapped  into  attribute  space  S  by  a  mapping 

ir  :  X  ^  S  =[0,1}K,  (1) 

where 

7r(l>)  =  (7Ti  («),•••  ,TTK(v))T  (2) 

is  an  attribute  score  vector.  Component  7Ti(v)  is  a  confi¬ 
dence  score  quantifying  the  presence  of  the  i-th  attribute 
in  v.  In  this  work,  these  scores  are  the  posterior  proba¬ 
bilities  7Tc(v)  =  p(c\v)  of  attribute  c  given  some  low-level 
representation  of  video  v,  e.g .,  a  BoF  histogram  of  spatio- 
temporal  descriptors. 

3.2.  Attribute-based  Activity  Recognition 

In  [15]  a  vector  of  attribute  scores  7 r(v)  is  computed  for 
the  whole  video  sequence  v.  This  holistic  attribute  rep¬ 
resentation  disregards  the  temporal  structure  of  the  differ¬ 
ent  attributes.  While  it  can  distinguish  activities  that  lie 
on  different  regions  of  S ,  it  cannot  disambiguate  activi¬ 
ties  that  contain  similar  attributes  but  with  different  tem¬ 
poral  structure.  This  problem  can  be  overcome  by  apply¬ 
ing  the  attribute  classifiers  to  video  segments  vt  extracted 
with  a  sliding  window.  As  illustrated  in  Figure  2,  this  pro¬ 
duces  a  sequence  of  attribute  score  vectors  {7Tt}^=1,  where 
7zt  =  7r(vt).  In  summary,  a  video  sequence  is  modeled  as 
a  trajectory  in  S  and  sequences  of  similar  semantics  span 
similar  trajectories. 

Li  and  Vasconcelos  proposed  to  model  a  video  trajectory 
in  S  with  a  binary  dynamic  system  (BDS)  [14],  defined  by 

r  Xt+X  =  Axt  +  vt,  (3a) 

l  Vt  ~  B(y;a(Cxt  +  u))9  (3b) 

where  xt  G  RL  ( L  is  the  dimension  of  the  latent  space)  and 
yt  G  [0, 1]K  are  state  and  observation  variables;  u  G  RK  a 
bias  term;  A  G  MLxL  a  state  transition  matrix;  C  G  RKxL 
an  observation  matrix;  vt  ~  A/*(0,  Q )  a  state  noise  pro¬ 
cess;  x\  =  +  Vq  ^  Af(fJbQ,So)  an  intial  condition; 


B(y;p )  a  multivariate  Bernoulli  distribution  of  parameter 
p  G  [0, 1]K ,  and  a(0)  a  component- wise  logistic  transfor¬ 
mation,  i.e.,  (Ji{0)  =  (1  +  e-6>')-1.  The  observation  model 
of  (3b)  can  be  interpreted  as  a  binary  principle  component 
analysis  (binary  PCA)  [24]  of  {yt}.  Binary  PCA  is  a  di¬ 
mensionality  reduction  technique  for  binary  data.  Given  a 
matrix  Y  =  \yir  ”  ,  yT\  G  {0,  l}Kxr,  it  determines  a 
L-dimensional  ( L  <C  K)  embedding  of  the  natural  parame¬ 
ters  0  of  the  Bernoulli  distribution,  by  maximizing  the  log- 
likelihood 


C  =  logp(T;0)  =  log 


IIo-(0fct)Yfc*CT(-0fct)1  Ykt 

-  k,t 


(4) 


subject  to  the  constraint 

0  =  CX  +  ulT,  (5) 

where  C  £  xi,  X  =  [x\,  ■  ■  ■  ,  xT ]  e  RLxt,  u  £ 
and  1  G  Rr  is  the  vector  of  all  ones.  Each  column  of  C  is 
a  basis  vector  of  a  latent  subspace  and  the  t-th  column  of 
X  contains  the  coordinates  of  the  yt  in  this  basis  (up  to  a 
translation  by  u). 

Since,  in  the  context  of  attribute  representations,  only  the 
the  attribute  scores  7T*  (and  not  the  attribute  variables  them¬ 
selves)  are  known,  [14]  replaced  the  log-likelihood  of  (4) 
by  the  expected  log-likelihood 

Ey  [£]  =  [nkt  log  <r(Q/ct)  +  (1  -  7 Tkt)  log  cr(— 0fct)]  •  (6) 

k,t 

The  maximization  of  (6)  under  the  constraint  of  (5)  can  be 
performed  with  an  expectation-maximization  (EM)  -like  it¬ 
erative  algorithm  [24],  which  produces  estimates  of  the  pa¬ 
rameters  C,  u  and  the  latent  sequence  X.  [14]  exploited  this 
to  propose  a  BDS  extension  of  the  popular  dynamic  texture 
algorithm  for  learning  linear  dynamic  systems  [5,  2].  Given 
a  sample  V 5  =  this  consists  of  learning  the  ob¬ 

servation  and  state  transition  models  in  two  steps.  The  first 
is  a  binary  PCA  analysis  of  to  determine  C,  u ,  and 
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the  coefficients  {xt}.  As  shown  in  Figure  2,  {xt}  is  a  tra¬ 
jectory  in  the  state  space,  which  follows  a  Gauss-Markov 
process.  The  second  step  determines  the  matrix  A  that  pro¬ 
vides  the  least  squares  fit  to  these  coefficients.  Note  that 
this  matrix  characterizes  the  state  space  trajectory,  which 
is  mapped  (given  C  and  u)  into  the  video  trajectory  in  S. 
Hence,  A  depicts  the  dynamics  of  the  attribute  sequence. 

3.3.  Bag  of  Words  for  Attribute  Dynamics 

While  substantially  more  descriptive  than  the  holistic  at¬ 
tribute  model  of  [15],  the  BDS  of  [1  ]  still  has  two  seri¬ 
ous  limitations  as  a  model  of  video  dynamics.  These  are 
illustrated  in  Figure  1.  First,  there  is,  in  general,  no  guar¬ 
antee  that  the  whole  video  sequence  depicts  the  activity 
of  interest.  On  the  contrary,  the  segments  that  matter  for 
event  recognition  ( e.g .,  a  segment  of  “tennis-serve”)  are  fre¬ 
quently  surrounded  by  segments  that  are  not  informative  for 
the  recognition  (e.g.,  video  of  subsequent  plays).  Fitting  a 
single  dynamic  model  to  long  video  sequences  will  lead  to 
parameter  estimates  that  are  not  representative  of  the  event 
of  interest.  Second,  since  complex  activities  are  composed 
of  several  atomic  actions,  sometimes  disjoint  in  time,  their 
state  trajectories  are  unlikely  to  follow  the  Gauss-Markov 
process.  Both  of  these  limitations,  however,  are  unlikely  to 
hold  if  the  BDS  is  fitted  to  a  short-term  video  segment. 

On  the  other  hand,  most  activities  can  be  effectively  in¬ 
ferred  by  a  characterization  of  the  short-term  segments  that 
compose  them.  For  example,  the  characterization  of  the 
activity  “long-jump”  by  the  attribute  sequence  “run-run”, 
“run-jump”  and  “jump-land”,  is  sufficient  to  discriminate 
it  from  the  (very  similar)  activity  “triple-jump”,  if  the  lat¬ 
ter  is  characterized  by  the  attribute  sequence  “run-jump”, 
“jump-jump”  and  “jump-land”.  The  presence  (or  absence) 
of  a  video  segment  with  attributes  “jump-jump”  is  sufficient 
to  discriminate  between  the  two  activities.  Based  on  these 
observations,  we  propose  to  model  video  with  an  extension 
of  the  BoVW  that  captures  the  short-term  dynamics  of  the 
attribute  representation  of  an  action. 

A  video  sequence  is  first  split  into  a  collection  of  tem¬ 
poral  overlapping  segments  Segment  sW  has  t* 

frames,  which  are  fed  to  the  attribute  mapping  of  (7).  This 
produces  a  set  of  attribute  score  vectors  11^  =  {7r^}^1, 
which  is  denoted  the  attribute  sequence  of  segment  s^\ 
The  video  sequence  is  finally  represented  by  a  bag  of  at¬ 
tribute  sequences  (BoAS),  which  plays  the  role,  in  the  pro¬ 
posed  framework,  of  the  BoF  in  image  classification.  A 
dictionary  of  representative  BDSs  f=1,  which  are  de¬ 

noted  words  for  attributes  dynamics  (WAD),  learned  from 
a  set  of  training  BoAS,  is  then  used  to  quantize  the  BoAS 
extracted  from  the  video  sequence  to  classify.  The  resulting 
histogram  of  WAD  counts,  denoted  a  bag  of  words  for  at¬ 
tribute  dynamics  (BoWAD)  is  finally  used  as  a  feature  vec¬ 
tor  for  video  classification.  This  representation  is  summa¬ 


rized  in  Figure  3. 

4.  Learning  and  Recognition  with  BoWADs 

In  section  5  we  will  show  that,  when  combined  with  stan¬ 
dard  histogram-based  classifiers  e.g.,  support  vector  ma¬ 
chines  (SVMs)  with  histogram  intersection  kernel  (HIK), 
BoWADs  are  a  very  effective  representation  for  the  recogni¬ 
tion  of  complex  activities.  For  now,  we  address  the  problem 
of  quantizing  attribute  sequences.  We  start  with  the  problem 
of  learning  a  WAD  dictionary. 

4.1.  Clustering  Samples  in  the  Model  Domain 

Traditional  clustering  (e.g.,  k- means)  searches  for  proto¬ 
types  in  the  space  of  training  samples  (e.g.,  in  k- means,  a 
cluster  prototype  is  the  centroid  of  the  samples  in  the  clus¬ 
ter),  using  a  metric  suited  for  that  space  (e.g.,  Euclidean 
distance).  An  extension  to  the  clustering  of  BoAS  is  not 
straightforward  because  1)  attribute  sequences  can  have  dif¬ 
ferent  length;  2)  the  space  of  these  sequences  has  non- 
Euclidean  geometry;  and  3)  the  search  for  optimal  proto¬ 
types,  under  this  geometry,  may  lead  to  intractable  non¬ 
linear  optimization.  More  importantly,  because  we  are  in¬ 
terested  in  characterizing  the  appearance  and  dynamics  of 
attribute  sequences,  it  is  more  desirable  to  find  a  set  of  pro¬ 
totype  BDSs  than  a  set  of  prototype  sequences. 

This  becomes  a  problem  of  learning  a  bag-of- 
models  (BoM)  where,  given  a  set  of  training  samples  V  = 
{zi}iLi  (zi  £  >2,  Vi),  the  goal  is  to  learn  a  dictionary  of 
representative  models  in  a  model  space  M.  The 

proposed  solution  is  based  on  two  mappings.  The  first 

fM  :  Z  D  {Zi}  ^  M({Zi })  G  M  (7) 

maps  a  collection  of  examples  {  Zi}  c  V  into  a  model 
M({zi}).  The  second, 

MxMd  (M1,M2)h+dM(M1,M2)  GM+  (8) 

is  a  measure  of  distance  between  models.  The  mapping 
of  (7)  is  first  used  to  produce  a  model  M(zf)  per  train¬ 
ing  example  Zi.  Training  samples  are  then  clustered,  at 
the  model  level,  by  alternating  between  two  steps.  In  the 
assignment  step,  each  is  assigned  to  the  cluster  whose 
model  is  closest  to  M(zf),  using  the  metric  (8).  In  the 
model  refinement  step,  the  model  associated  with  each  clus¬ 
ter  is  relearned  from  the  training  samples  assigned  to  it, 
via  (7).  This  procedure  is  summarized  in  Algorithm  1  and 
denoted  bag-of-models  clustering  (BMC). 

BMC  generalizes  k- means,  where  Zi  G  Md  are  feature 
vectors,  M  is  the  family  of  Gaussians  of  identity  covariance 

M  =  {p{z;n)  =  g(z;fi,Id)  \  n  eRd  },  (9) 

(7)  selects  the  model 

M{{zi})  =  G(z;  /*,/),  (10) 
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Figure  3:  BoWAD  representation  of  the  activity  “diving- springboard”.  (Top)  video  sequence.  (Middle)  the  holistic  vector  of  attribute 
scores  is  now  represented  as  a  trajectory  in  the  attribute  space  (which  is  four  dimensional,  in  this  example,  and  represented  as  four  colored 
functions).  The  trajectory  is  split  into  overlapping  sort-term  segments.  (Bottom)  each  segment  is  assigned  to  the  WAD  associated  with  the 
BDS,  in  a  learned  BDS  dictionary,  that  best  explains  it.  Dictionary  BDSs  are  models  of  short-term  behaviors,  such  as  “walk- walk-jump”, 
“walk-jump-jump”,  “jump-jump-somersault”  and  “jump-somersault-enter  water”.  The  activity  is  represented  by  a  BoWAD,  which  is  a 
histogram  of  assignments  of  segments  to  WADs. 


Algorithm  1:  Bag-of-Models  Clustering 

Input  :  a  set  of  samples  V  =  {zij  f=1  (Zi  G  0,  Vi), 
number  of  clusters  Nc,  an  initial  set  of 
models 

sett  =  0  and  =  0,  i  =  1,  •  •  •  ,  Nc', 

repeat 

t  =  t  +  1; 

Assignment- Step:  Vi,  =  {z  G  V  \\/j  i, 

Refinement-Step:  Vi,  M]t]  =  M({S^}) 
until  Vi,  sf }  =  Sf_1); 

Output:  and  ^ 


where  \i  is  the  maximum  likelihood  estimate  of  the  mean 

A  =  argma xp({zz};  p,)  =  1  'Y\.zu  (H) 

n  \xzis\  1 

and  the  measure  of  (8)  is  the  (symmetric)  Kullback-Leibler 
divergence 

KL(Pl||p2)  +  KL(p2||p1)=  ||Mi  -  M2II2-  (12) 

It  should  be  noted  that  BMC  (Algorithm  1)  differs  from 
the  bag -of- systems  method  of  [21,  1]  in  two  ways.  First, 
it  clusters  attribute  sequences  rather  than  the  models  them¬ 
selves,  as  is  done  by  [21,  1].  Note  that,  in  the  model  refine¬ 
ment  step  of  Algorithm  1 ,  models  are  re-learned  from  exam¬ 
ples  {z^.  The  refinement  step  of  [21,  1]  only  considers  the 
parameters  of  the  models  M(zf)  and  not  the  examples  zi 
themselves.  This  usually  entails  loss  of  information.  Sec¬ 
ond,  Algorithm  1  finds  the  optimal  representative  for  each 


cluster,  according  to  the  model  fitting  criterion  of  (7).  In 
[21],  the  difficult  geometry  of  the  manifold  defined  by  the 
LDS  parameter  tuple  (A,C)  G  GL(n)  x  §T(p,  n),  where 
GL(i)  is  the  set  of  invertible  matrices  of  size  n  and  §T(p,  n) 
the  Stiefel  manifold  of  p  x  n  orthonormal  matrices  (p  >  n), 
precludes  a  simple  estimate  of  the  optimal  representative. 
Instead,  this  is  approximated  by  searching  for  the  model 
M(zi)  closest  to  the  optimal  representative.  Although  [1] 
introduce  an  approach  to  directly  cluster  LDSs  in  their  pa¬ 
rameter  space,  its  generalization  to  BDS  is  still  not  quite 
clear.  We  will  show,  in  Section  5,  that  these  differences  can 
lead  to  significantly  improved  performance  by  Algorithm  1 . 

4.2.  Learning  a  Vocabulary  of  WADs 

A  WAD  dictionary  is  learned  by  applying  Algorithm  1 
to  a  BoAS  V  =  {nW}f=1,  as  follows. 


Algorithm  2:  Learning  a  Cluster  for  WADs  Dictionary 
Input  :  a  set  of  n  sequences  of  attribute  score  vectors 
{{Tr^KiiKU,  state  sPace  dimension  L. 
Binary  PCA: 

{C,X,u}  =  B-PCA({{7rt(i)}[i1}”=1,i)  [24], 
Estimate  state  parameters: 

A  =  Xl(Xl~1)\  V  =  X£  -AX{-\ 

(where  !j=  [(X«)5V  ,(!«)?], 

xr\-  [(iw)?-1,..  .(iw)r1], 

and 2^  =  ,a:t2]). 

Q  =  ^  =  n  E?=i  *1°. 

So  =  At  ET-iObS0  “  -  Mo)r- 

Output:  ft  =  {A,C,Q,u, /jL0,  So} 
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Refinement-Step:  The  mapping  of  (7)  amounts  to  fit¬ 
ting  a  BDS  to  a  BoAS  V  =  {11^}  c  V.  This  is  done 
with  recourse  to  Algorithm  2,  which  extends  the  algorithm 
of  [14]  for  learning  a  BDS  from  a  single  attribute  sequence. 
The  extension  follows  the  two-step  decomposition  of  BDS 
learning  discussed  in  Section  3.2.  A  binary  PC  A  is  first 
applied  to  all  attribute  score  vectors  in  V' .  The  parame¬ 
ters  of  the  hidden  Gauss-Markov  process  are  then  learned 
by  solving  a  least  squares  problem  involving  all  latent  state 
sequences  returned  by  binary  PCA.  In  this  way,  the  BDS 
learned  per  cluster  jointly  characterizes  the  appearance  and 
dynamics  of  all  attribute  sequences  in  that  cluster. 

Assignment-Step:  As  a  measure  of  distance  between 
two  BDSs,  we  use  the  Binet-Cauchy  (BC)  kernel.  This  was 
originally  proposed  in  [26]  as  a  measure  of  dissimilarity  be¬ 
tween  infinite  output  sequences  of  two  LDSs,  and  adapted 
to  a  measure  of  the  dissimilarity  between  the  outputs  of  two 
BDSs,  fla  and  fib,  in  [14].  It  is  defined  as 

dBc(Sla,  fib) 

oo 

=  [  £  e~Xt  (- KL(B(a(0[a) ))  I \B(a(e[b: >))) 

t= 0 

+  KL(B(a(e[b)))\\B(a(d[a)))))]  lL" 

OO  rj^ 

=  E«  [ £  e~xt  (a(flW)  -  (*<»>  -  *<*>)  } , 

t  =  0 

where  {<j(0^)}  and  {<r(0^)}  are  the  parameters  of  the 
multivariate  Bernoulli  distributions  from  which  the  binary 
attribute  vectors  are  sampled,  for  the  two  BDSs.  While  the 
BC  kernel  between  two  LDSs  can  be  computed  in  closed 
form,  the  evaluation  of  (13)  is  not  trivial.  Like  the  latent 
state  sequence  { xt },  its  linear  projection  {6t}  is  a  sample 
from  a  high-dimensional  Gaussian  distribution.  Hence,  (13) 
amounts  to  computing  the  expectation  of  a  nonlinear  func¬ 
tion  with  respect  to  a  multivariate  Gaussian  distribution,  and 
is  intractable  in  general.  Following  [14],  we  resort  to  a  nu¬ 
meric  solution  which  approximates  the  summation  by  a  fi¬ 
nite  number  of  terms.  This  has  been  empirically  shown  to 
produce  good  results. 

4.3.  Quantization 

Given  a  WAD  dictionary  a  BoAS 

{{7r<t"> }t=i}iLi  *s  quantized  by  assigning  the  i-th  at¬ 
tribute  sequence  to  the  k*- th  cluster  according  to 

k*  =  argminj  dBC(n({7r^)}[i1), (14) 

where  0({7T(At-Li)  is  the  BDS  learnt  from  {7r^}t=i  us¬ 
ing  (7). 

5.  Experiments 

A  number  of  experiments  were  performed  to  compare 
the  BoWAD  representation  to  previous  models  of  temporal 


Table  1:  Accuracy  on  Weizmann  Activity. 


Sets 

BoF 

BoF-TP 

[11] 

Attri¬ 

bute 

[15] 

BDS  _ 
[14] 

BoWAD 

MDS-kM  BMC 

Syn20  x  1 

23.3% 

36.7% 

17.8% 

64.4% 

100% 

100% 

Synl0x2 

28.9% 

31.1% 

16.7% 

65.6% 

98.9% 

100% 

activity  structure.  The  low-level  representation  used  in  all 
experiments  was  the  BoF  of  []  ].  A  set  of  spatio-temporal 
interest  points  (STIPs)  were  first  detected,  a  feature  vec¬ 
tor  was  extracted  from  the  support  of  each  interest  point, 
and  quantized  into  a  vocabulary  learnt  from  the  training  set. 
Binary  SVMs  using  histogram  intersection  kernel  (HIK) 
with  probability  outputs  [2  ]  were  used  as  attribute  models, 
learned  from  annotated  training  video  clips  (see  supplemen¬ 
tary  material  for  attribue  definitions).  In  all  experiments, 
BDS  and  BoWADs  used  a  5-dimensional  state  space. 

5.1.  Weizmann  Activity 

The  first  set  of  experiments  was  based  on  composite  se¬ 
quences  synthesized  from  the  Weizmann  dataset  [8],  which 
contains  10  atomic  action  classes,  performed  by  9  people, 
for  a  total  of  90  samples.  BoWAD  was  compared  to  the 
vanilla  BoF,  BoF  with  £3  temporal  pyramids  [11]  (denoted 
“BoF-TP”),  holistic  attributes  [15]  (denoted  ‘Attribute”) 
and  BDS  [14].  Attribute  sequences  were  computed  over  30- 
frame  sliding  video  windows  of  10-frame  step.  As  in  [14], 
30  low-level  attributes  were  defined  for  the  original  10  ac¬ 
tions.  To  compute  BoWADs,  each  short-term  attribute  se¬ 
quence  consisted  of  the  attribute  vectors  from  12  consecu¬ 
tive  windows,  extracted  with  a  step  of  3  windows.  WAD 
dictionaries  were  learned  with  both  BMC  and  the  MDS-kM 
algorithm  of  [2  ]  .  One-v.s.-all  SVMs  with  HIK  were  used 
in  all  histogram-based  methods  (BoF,  BoF-TP,  BoWAD, 
attribute  models),  where  STIP  features  used  a  1000- word 
vocabulary.  For  BDS,  we  used  the  kernel  K(fla,  fib)  = 
exp(  —  ^d2BC(fla,  fib))  (same  for  the  rest  of  experiments). 

Two  datasets  were  created.  The  first,  “Syn20x  1”,  aimed 
to  test  the  ability  of  the  different  approaches  to  detect  activ¬ 
ity  classes  of  large  variability.  An  activity  was  defined  as  a 
sequence  of  20  consecutive  atomic  actions  from  Weizmann. 
This  sequence  was  inserted  at  a  random  temporal  location 
of  a  larger  sequence  of  40  atomic  actions.  The  remaining  20 
actions  in  the  larger  sequence  were  randomly  selected  from 
Weizmann.  The  second,  “Synl0x2”,  tested  the  ability  of 
the  different  approaches  to  detect  discontinuous  activities. 
In  this  case,  each  activity  was  defined  by  two  subsequences, 
each  with  10  consecutive  atomic  actions.  The  two  subse¬ 
quences  were  randomly  inserted  at  non- overlapping  loca¬ 
tions  of  the  larger  (40  atomic  action)  sequence. 

Table  1  summarizes  the  performance  of  the  different 
methods.  The  very  weak  performance  of  BoF,  BoF-TP,  and 
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Table  2:  Average  Precisions  for  Activity  Recognition  on  Olympic  Sports  Dataset. 


Activity 

Laptev 
et  al.  [  1  ] 
(BoF-TP) 

Niebles 
et  al. 
[16] 

Tang 
et  al. 

b  ] 

Attri¬ 

bute 

[15] 

BDS 

[14] 

BoWAD 

MDS-kM 

[21] 

BMC 

high-jump 

52.4% 

68.9% 

18.4% 

93.2% 

82.2% 

86.8% 

83.9% 

long-jump 

66.8% 

74.8% 

81.8% 

82.6% 

92.5% 

83.9% 

91.9% 

triple-jump 

36.1% 

52.3% 

16.1% 

48.3% 

52.1% 

64.2% 

75.7% 

pole-vault 

47.8% 

82.0% 

84.9% 

74.4% 

79.4% 

68.0% 

76.5% 

gym.  vault 

88.6% 

86.1% 

85.7% 

86.7% 

83.4% 

86.7% 

91.4% 

shot-put 

56.2% 

62.1% 

43.3% 

76.2% 

70.3% 

58.0% 

79.4% 

snatch 

41.8% 

69.2% 

88.6% 

71.6% 

72.7% 

56.4% 

73.4% 

clean-jerk 

83.2% 

84.1% 

78.2% 

79.4% 

85.1% 

78.2% 

85.4% 

javelin  throw 

61.1% 

74.6% 

79.5% 

62.1% 

87.5% 

56.6% 

76.7% 

ham.  throw 

65.1% 

77.5% 

70.5% 

65.5% 

74.0% 

71.3% 

79.2% 

discus  throw 

37.4% 

58.5% 

48.9% 

68.9% 

57.0% 

62.6% 

66.9% 

diving-plat. 

91.5% 

87.2% 

93.7% 

77.5% 

86.0% 

85.2% 

82.0% 

diving- sp.  bd. 

80.7% 

77.2% 

79.3% 

65.2% 

78.3% 

75.2% 

82.3% 

bask,  layup 

75.8% 

77.9% 

85.5% 

66.7% 

78.1% 

66.6% 

60.8% 

bowling 

66.7% 

72.7% 

64.3% 

72.0% 

52.5% 

64.4% 

73.0% 

tennis-serve 

39.6% 

49.1% 

49.6% 

55.2% 

38.7% 

68.1% 

73.2% 

mean  AP 

62.0% 

72.1% 

66.8% 

71.6% 

73.2% 

70.8% 

78.2% 

#  BDS  codeword 


Figure  4:  Mean  average  precision  (mAP) 
v.s.  size  of  BDS  dictionary  on  Olympic 
Sports.  Vertical  bars  indicate  standard  devi¬ 
ation  of  mAP  in  cross-validation. 


Attribute,  show  that  modeling  of  activity  dynamics  is  crit¬ 
ical  for  success  in  these  datasets.  While  BDS  has  substan¬ 
tially  improved  performance,  the  underlying  assumption  of 
a  single  dynamic  process  is  a  limitation  for  these  sequences, 
where  the  activities  of  interest  are  not  temporally  aligned 
and  are  surrounded  by  irrelevant  video.  Substantially  bet¬ 
ter  performance  is  achieved  with  the  BoWAD  representa¬ 
tion,  which  has  perfect  performance  on  these  datasets.  Both 
clustering  strategies  achieve  good  results,  although  BMC 
outperforms  MDS-kM  slightly. 

5.2.  Olympic  Sports 

The  second  set  of  experiment  was  conducted  on  the 
Olympic  Sports  dataset  [16].  The  performance  of  BoWADs, 
learned  with  BMC  and  MDS-kM ,  was  compared  to  BoF- 
TP  [1  ],  activity  models  with  decomposable  segments  [16], 
the  hidden  Markov  model  with  latent  states  of  variable  dura¬ 
tion  of  [25],  the  holistic  attribute  representation  of  [15],  and 
the  BDS  [14].  In  all  cases,  a  3000-word  STIP  vocabulary 
was  used  to  quantize  low-level  features.  BDS  and  BoWAD 
used  the  40  attributes  defined  by  [15].  A  30  frame  sliding 
video  window,  with  a  step  of  4  frames,  was  used  to  compute 
attribute  scores.  For  the  BoWAD,  attribute  sequences  con¬ 
sisted  of  12  consecutive  attribute  vectors,  with  a  75%  over¬ 
lap  between  consecutive  sequences.  Performance  was  mea¬ 
sured  with  per-category  average  precisions  (AP)  and  mean 
AP,  using  5-fold  cross-validation. 

As  shown  in  Table  2,  the  BoWAD  again  achieves  the 
best  results.  In  fact,  it  achieves  the  best  results  reported 
in  the  literature  with  the  similar  low-level  features  (STIP) 
on  this  dataset.  This  includes  methods  based  on  much 
more  sophisticated  classifiers,  such  as  the  74.4%  of  [T  ] 
or  the  76.5%  of  [1  ],  which  use  latent  SVMs  or  multiple 
kernel  classifiers  to  combine  supervised,  unsupervised  at¬ 


tributes  (dynamics),  and  low-level  features.  The  BoWAD 
achieves  78.2%  by  simply  quantizing  attribute  dynamics. 
It  works  particularly  well  for  categories,  such  as  “tennis- 
serve”,  which  have  large  variability  and  tend  to  include 
video  irrelevant  for  activity  detection,  or  category  pairs, 
such  as  “triple-jump”  and  “long-jump”,  that  differ  in  subtle 
ways.  The  robustness  inherent  to  a  vocabulary  of  dynamics 
is  critical  for  the  former  (compare  the  73.2%  of  BoWAD- 
BMC  with  the  38.7%  of  BDS  on  “tennis  serve”),  while  the 
detailed  characterization  of  attribute  dynamics  is  critical  for 
the  latter  (75.7%  v.v.  48.3%  of  Attribute  on  “triple-jump”). 
With  regards  to  clustering  algorithms,  there  is  now  a  sub¬ 
stantial  gap  between  MDS-kM  (70.8%)  and  BMC  (78.2%). 
Figure  4  shows  that  this  difference  holds  across  a  large 
range  of  WAD  dictionary  sizes.  The  robustness  of  the  pro¬ 
posed  representation  is  reinforced  by  the  fact  that  a  320- 
word  BoWAD  has  mAP  (75%)  superior  to  all  other  repre¬ 
sentations  of  Table  2. 

5.3.  TRECVID-MED11 

The  third  set  of  experiments  used  the  2011  TRECVID 
multimedia  event  detection  (MED)  open  source 
dataset  [17].  The  event  collection  (EC)  set  was  used 
for  training  and  the  development  set  (DEVT)  for  testing 
(events  1-5).  EC  contains  2,062  training  samples  of  5 
high-level  events,  with  100-200  positive  examples  per 
event.  DEVT  has  around  11,000  samples.  We  manually 
defined  93  attributes  and  used  a  10, 000- word  low-level 
feature  dictionary.  Attribute  scores  were  computed  with 
a  180-frame  sliding  window  with  steps  of  30  frames,  and 
attribute  sub- sequences  (r  =  10)  were  extracted  every 
window.  BoWAD  used  a  dictionary  of  size  1000. 

The  performance  of  the  different  methods  is  summa¬ 
rized  in  Table  3.  On  this  highly  challenging  dataset,  the 
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Table  3:  Average  Precision  for  Event  Detection  on  TRECVID  MED1 1  DEVT  Dataset. 


Event 

(E001-E005) 

Random 

Guess 

Laptev  et  al. 

[11]  (BoF-TP) 

Niebles  et  al. 
[16] 

Tang  et  al.  [25] 

(d  =  1  /  d  ^  dmax ) 

Attribute 

[15] 

BDS 

BoWAD 

[14] 

MDS-kM  [21] 

BMC 

attempting  a  board  trick 

1.18% 

8.22% 

5.84% 

6.24%/ 15.44% 

18.91% 

8.41% 

26.62% 

29.99% 

feeding  an  animal 

1.06% 

2.54% 

2.28% 

5.28%/ 3.55% 

4.95% 

1.78% 

4.61% 

7.36% 

landing  a  fish 

0.89% 

9.77% 

9.18% 

7.30%  /  14.02% 

24.17% 

6.20% 

24.97% 

28.10% 

wedding  ceremony 

0.86% 

5.52% 

7.26% 

9.48%/ 15.09% 

16.68% 

12.24% 

22.15% 

22.39% 

working  on  a  wood  project 

0.93% 

4.09% 

4.05% 

3.42% /8.17% 

5.11% 

5.08% 

12.39% 

18.32% 

meanAP  0.98%  6.01%  5.72%  6.34%  7  11.25%  13.96%  6.74%  18.15%  21.23% 


gap  between  BoWAD  and  the  other  representations  is  enor¬ 
mous.  In  fact,  the  BoWAD  learned  by  BMC  (21.23%)  al¬ 
most  doubles  the  best  previous  results  in  the  literature  that 
model  temporal  structure  of  complex  events  (i.e.,  11.25% 
of  [25]).  The  fact  that  the  BoWAD  substantially  outper¬ 
forms  the  BDS  also  confirms  the  observation  that  the  ro¬ 
bustness  of  a  vocabulary  of  local  attribute  dynamics  is  crit¬ 
ical  for  accurate  detection  of  complex  activities.  For  exam¬ 
ple,  events  in  the  class  “attempting  a  board  trick”  include  a 
repetition  of  local  actions,  e.g.,  “ slide-jump- (somers ault)- 
land- slide”.  While  it  is  difficult  to  model  this  sequence 
as  a  whole,  due  the  large  variability  of  cutting  in  different 
videos,  it  is  much  easier  to  capture  short-term  signature  ac¬ 
tions,  such  as  “slide-jump”,  which  are  usually  not  broken 
during  video  editing.  Finally,  with  respect  to  clustering  al¬ 
gorithms,  BMC  agains  substantially  outperforms  MDS-kM. 

6.  Conclusion 

In  this  work,  we  proposed  a  novel  solution  to  the  prob¬ 
lem  of  modeling  attribute  and  dynamics  for  activity  recog¬ 
nition.  The  method  combines  the  advantages,  in  terms 
of  robustness,  of  histogram-based  representations,  with  the 
power  of  BDSs  to  model  the  dynamics  of  video  attributes. 
We  developed  new  algorithms  for  learning  BDS  dictionar¬ 
ies  and  quantizing  video  with  them.  The  proposed  rep¬ 
resentation  significantly  outperforms  other  state-of-the-art 
attribute-based  or  temporal- structure-modeling  approaches 
in  complex  activity  recognition. 
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